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Abstract 

Active learning aims to obtain a classifier of high 
accuracy by using fewer label requests in com¬ 
parison to passive learning by selecting effective 
queries. Many active learning methods have been 
developed in the past two decades, which sample 
queries based on informativeness or representa¬ 
tiveness of unlabeled data points. In this work, 
we explore a novel querying criterion based on 
statistical leverage scores. The statistical lever¬ 
age scores of a row in a matrix are the squared 
row-norms of the matrix containing its (top) left 
singular vectors and is a measure of influence 
of the row on the matrix. Leverage scores have 
been used for detecting high influential points in 
regression diagnostics (Chatterjee & Hadi, 1986) 
and have been recently shown to be useful 
for data analysis (Drineas et al., 2008) and ran¬ 
domized low-rank matrix approximation algo¬ 
rithms (Gittens & Mahoney, 2013). We explore 
how sampling data instances with high statistical 
leverage scores perform in active learning. Our 
empirical comparison on several binary classifi¬ 
cation datasets indicate that querying high lever¬ 
age points is an effective strategy. 

1. Introduction 

A passive supervised learning algorithm for classification 
induces a model with the available set of labeled instances. 
However, in many modern machine learning applications, 
in addition to this limited set of labeled instances, there is 
a large pool of unlabeled instances. For cases where the 
cost of labeling data is high relative to that of collecting the 
unlabeled data, active learning strategies have been shown 
to be useful. In a classical active learning framework for 
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supervised classification (Cohn et al., 1994; Settles, 2009), 
the learner can interact with an oracle (i.e. human annota¬ 
tor) that provides labels when queried. Typically, an active 
learner begins with a small set of labeled instances, selects 
one or a batch of examples from a pool of unlabeled data 
and queries the labels for these selected examples. Once 
the oracle provides the new labels, these examples are aug¬ 
mented to the training set; the active learner is retrained, 
and this process is repeated until a halting criterion (i.e. 
desired accuracy) is satisfied. Through selectively decid¬ 
ing which examples to label, the active learner aims to ob¬ 
tain a classifier of high accuracy by using fewer label re¬ 
quests and thereby reducing the total labeling cost. Dif¬ 
ferent strategies (Settles, 2009) of querying examples have 
been suggested. In this work, we explore a novel direction 
for querying that is based on statistical leverage scores. 

The statistical leverage has found extensive applications 
in diagnostic regression analysis (Chatterjee & Hadi, 1986; 
Hoaglin & Welsch, 1978). Statistical leverage scores have 
been recently shown to be useful for data analysis such as 
CUR decomposition and randomized low-rank matrix ap¬ 
proximation algorithms. In CUR decomposition, the ma¬ 
trix is approximated with a product CUR, where C and 
R are respectively small subsets of the columns and rows 
of the matrix U is computed from C and R (Drineas et al., 
2006). (Drineas et al., 2008) introduced a method where 
the matrix columns are sampled randomly with probability 
proportional to their leverage scores. Similarly, Nystrom 
extensions are sampling based randomized low-rank ap¬ 
proximations to positive-semidefinite matrices. Gittens et 
al. analyzed different Ny strom sampling strategies for 
SPSD matrices and showed that samplings based on lever¬ 
age scores are quite effective (Gittens & Mahoney, 2013). 

In the aforementioned work, leverage scores were used for 
approximation purposes. The intuition in these methods 
is that leverage score sampling ensures important columns 
(or rows) are included in the approximation. In this study 
we instead exploit leverage scores to find examples with 
important feature vectors in the data and query the in- 
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stances with high statistical leverage scores. Our proposed 
method, Active Learning by Statistical Leverage Sampling 
(ALEVS), exhibits good empirical performance on differ¬ 
ent benchmark datasets. The rest of the paper is organized 
as follows: in section 2, we describe the problem set up 
and our approach ALEVS; in section 3, the experiments 
are described in detail; in section 4 we discuss the empiri¬ 
cal performance of ALEVS on different datasets; in section 
5 results are elaborated on and the conclusions are stated. 

2. Problem Set Up and Approach 

2.1. Problem Set Up 

We denote V = {(xi, y x ), (x 2 , 2/2 x n , y n )} the train¬ 

ing data set that contains n instances, where each instance 
Xi = [xn,Xi 2 , • • •, x Xi d\ is a vector of d dimension and 
yi G { — 1,1} is the class label of x$. The initial dataset 
comprises a small set of labeled examples, and a large pool 
of unlabeled examples. At each iteration t of active learn¬ 
ing, a perfect oracle O is queried with an unlabeled ex¬ 
ample x g and the oracle returns the label y q with uniform 
cost across examples. We denote the labeled set of train¬ 
ing examples at iteration t with V\ and the set of unlabeled 
examples with V l u . Our aim is to attain a good accuracy 
classifier h* with minimal number of queried examples. 

2.2. ALEVS: Sampling Based on Statistical Leverage 
Scores 

At an iteration t, the classifier, h t is trained only with the 
labeled training examples V\ and the data is divided into 
two portions based on class memberships. Two feature ma¬ 
trices are formed. X^_ is a m x d feature matrix, where 
the rows are the feature vectors of examples with positive 
class membership at iteration t. These examples are those 
that are positively labeled in V\ and those that are in V l u 
but have predicted positive labels according to h t . is 
similarly constructed from negatively predicted and labeled 
examples. 

After the prediction of the labels of unlabeled data, ALEVS 
computes a kernel matrix over X^ and X^ separately. 
In our experiments we employed linear kernel and Gau- 
sian Radial Basis (RBL) kernel. Over a set of data points 
xi,..., x n G M d , the linear kernel matrix K correspond¬ 
ing to those points is given by 


Algorithm 1 ALEVS: Active Learning with Leverage Score 
Sampling 

Input: D a training dataset of n instances; Labeling or¬ 
acle O ; low-rank parameter k; kernel parameters if any 
Output: Classifier h* 

Initialize: 

V® % initial set of labeled instances 

<- V \ V® % the pool of unlabeled instances 

repeat 

-Classification- 

Train classifier h t with training data D\ 

Get predicted class labels y* by applying h t on V l u 

-Sampling- 

Based on y f u and y[, construct X^ and X.*_ 

Compute kernel matrix K^_ on X^_ 

Compute kernel matrix on X^ 

Compute leverage scores on using Eq. 4 

Compute leverage scores on using Eq. 4 

Get x q with the highest leverage score in V l u 
Query O its label y q 

-Update- 

v \ +1 (x q , y q ) 

D^Z^Xq 
t i — t -hi 

until stopping criterion 
h* <— h t 
Return h* 


In the above equation cr is a nonnegative real number that 
determines the scale of the kernel. The choice of cr is dis¬ 
cussed in the experimental section. 

As described in (Gittens & Mahoney, 2013), the leverage 
scores of a SPSD kernel matrix K G R nxn can be calcu¬ 
lated as follows. K = USU T is the eigen decomposition 
of K. We can partition U as 

U=(ui U 2 ), (3) 

where U i comprises k orthonormal columns spanning the 
top ^-dimensional eigenspace of K. The leverage score of 
the jth column of K is defined as the squared Euclidean 
norm of the j th row of Ui : 

tj = \\(Ui) {j) \\l ( 4 ) 


Kij = ( x *,Xj). (1) 

RBL kernel matrix K corresponding to these same points 
is given by 


ii x » - x i 

2 <T 2 



( 2 ) 


After the leverage scores are computed within each class, 
the example to query x g is determined by selecting the un¬ 
labeled example with the highest leverage score: 

Xo = arg max (5) 

Steps of ALEVS are summarized in Algorithm 1. 


= exp 
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(e) U vs. V,k = 60, RBF 



(i) spambase , k = 60, RBF (j) 3 vs. 5, k = 20, linear 



(d) D vs. P,k = 60, RBF 



(h) ringnorm , k = 60, RBF 


Figure 1. Comparison of ALEVS with baselines on classification accuracy. 


3. Experiments 

We compare ALEVS with the following baseline ap¬ 
proaches: (1) Random Sampling: randomly select query 
instances, (2) Uncertainty Sampling: selects the instance 
with maximal uncertainty, (3) Leverage on all data: com¬ 
putes the leverage score on the V at the beginning of the 
iteration without paying attention to class membership and 
selects unlabeled queries in the order of their leverage 
scores. The last baseline decides whether separating the 
examples based on their predicted class membership has 
any value or not. 

In Uncertainty Sampling, to find the most uncertain un¬ 
labeled datapoint based on the SVM output, we estimate 
the posterior probabilities of each unlabeled instance with 
Platt’s algorithm (Platt et al., 1999). The most uncertain 
point is the one with maximal (1 — p(y* | x$)). 

Ten different datasets are used in our study and their de¬ 
scriptions are given in Table 1. The digitl , g241c , g241n , 
USPS datasets are from (Chapelle et al., 2006). The spam- 
base dataset and letter are from (Lichman, 2013). The let¬ 
ter dataset is a multi-class dataset, we selected letter pairs 
that are difficult to distinguish: letter(D vs 1 . P) and letter 
(U vs. V). Similarly, we work on MNIST(3 vs 1 . 5) which 
is one of the most confused pairs in the handwritten digit 
dataset MNIST (Lecun & Cortes). Finally, the splice and 
ringnorm are culled from Gunnar Raetsch‘s benchmark 
datasets (Ratsch et al., 2001). In all experiments, an SVM 
classifier with RBF kernel is used as the classifier. For the 
RBF kernel scale parameter is selected automatically by a 
heuristic method of built-in SVM function in MATLAB. 


Table 1. Datasets and their dimensions. 


dataset 

# instances 

# features 

digitl 

1500 

241 

g241c 

1500 

241 

g241n 

1500 

241 

letter (DvsP) 

1608 

16 

letter (UvsV) 

1577 

16 

USPS 

1500 

241 

splice 

2991 

60 

ringnorm 

2000 

20 

spambase 

2000 

57 

MNIST (3vs5) 

2000 

784 


Each dataset is divided into two portions at random. The 
first portion is held-out for testing purposes and the other 
half is used for training. We start with 4 initially labeled 
examples. At each iteration, the classifier is updated for 
all methods and the accuracies are calculated on the same 
held-out test data. For each dataset the experiment is re¬ 
peated 50 times and for each replicate, the partitioning of 
the whole data into training and test sets is random. The ac¬ 
curacies reported in figures are the average accuracies over 
these random trials with shaded area representing standard 
error. In calculating leverage scores we experimented with 
both RBF and linear kernel. Here we report the best per¬ 
forming cases. 

4. Results 

Figure 1 shows the classification accuracy of ALEVS and 
the baselines with varied numbers of queries. We observe 
that in seven out of ten datasets (Fig. a, b, c, g-j), ALEVS is 
able to outperform the baseline methods. In three datasets, 
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the performance is comparable to that of uncertainty. In 
USPS dataset (Fig. f), ALEVS beats Random Sampling 
and Leverage on All, however it is performance is only as 
good as Uncertainty Sampling. In letter(D vs. P) (Fig. d ) 
and letter(U vs. V) (Fig. e) dataset, the initial performance 
of ALEVS is very good, but as the number of queries in¬ 
creased Uncertainty Sampling outperforms ALEVS. In all 
results, at early iterations, ALEVS seems to query better 
data points. One strategy could be start with ALEVS and 
switch to another sampling strategy at further iterations. 

The baseline Leverage on All sampling achieves a per¬ 
formance in between ALEVS and Uncertainty Sampling. 
This method calculates leverage scores for the kernel ma¬ 
trix computed over all data, whereas ALEVS first forms 
partitions based on the class membership. From the results, 
we conclude that this division is valuable. It might even be 
interesting to further divide data into clusters and calculate 
leverage scores of examples within their own clusters. 

We probed the effect of k parameter to the resulting perfor¬ 
mance. In the experiments, we operated with k values 20, 
40, 60 and 80. For the sake of simplicity for each dataset, 
we include results with best k values. We observe that for 
USPS and splice datasets, k affects the accuracy drastically. 
In our future line of work, we will investigate systematic 
means to set the parameter k based on the input matrix 
structural properties. 

5. Conclusion 

In this paper, we propose a new method, ALEVS, that sam¬ 
ples data points based on their statistical leverage scores. 
The leverage scores are calculated on kernel matrices con¬ 
structed from the feature vectors of the instances. Empir¬ 
ical comparison with baseline methods demonstrates that 
sampling high-leverage points are indeed useful. In ad¬ 
dition to the future work discussed in the Results section, 
we consider improving the computational efficiency. Since 
the input data matrices to the leverage score computation 
have overlap across iterations, we will investigate ways of 
reusing leverage computations in previous iterations to cal¬ 
culate the leverage scores for the current iteration. 
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